Note: I originally wrote this on Quora in 2019.
Paul, being a fresh statistics graduate, is dying to finally apply a machine learning model in practice. He is hired at a healthcare company and his first task as a working professional is to classify which “Decile” a given patient belongs to.
Rather than researching the context of what a “Decile” is, Paul decides to stay strictly within his area of expertise: data.
Because he doesn’t want to bias his prediction in anyway, Paul designates 80% of the data for training and 20% for testing. He then sets the testing data aside so no information from it is leaked in while building his model on the training data.
Moving forward, everything Paul does will be on the training data with the exception of his final prediction using his first professional ML model.
To get started, Paul decides it would be a good idea to visualize his outcome variable:
He recognizes the unbalanced nature of his data, but has been told to report accuracy specifically and thus disregards other metrics he feels may be more suitable.
Paul also decides to look through all of the variables he has available to him, and decides that he wants to use the following predictors:
Using these predictors, Paul finally builds his final model using an ensemble decision tree method called random forest.
He makes his predictions on the test data and is finally ready to assess its performance. To aid in his anticipation, Paul decides he would first like to visualize the performance with a heat map of the confusion matrix. A perfect accuracy would depict a diagonal line.
Paul scratches his head in confusion when he looks at his confusion matrix (no pun intended).
He takes a look at the overall accuracy. 99.9%
Impossible. This is as close to perfect as perfect gets.
Paul consults his mentor, a seasoned data scientist at his company. His mentor laughs and tells him he should look at the relationship between all of his predictor variables and his outcome.
Paul does as suggested and notices something he should have caught before.
The relationship between paid amounts and Decile tends to be quite linear. In fact, the correlation between the two is 0.8396, a strongly positive relationship as well as a giant red flag.
After showing his mentor these results, Paul learns that the Decile field is actually built using the paid amounts — “Decile” represents the decile of costs that a patient belongs to.
Paul remembers from his statistics class that this is a clear case of tautological bias, a form of “cheating” by means of using a different version of the outcome to predict itself.
In this case, the paid amount and the “Decile” fields are more or less the same thing, and thus explains why using paid to predict “Decile” is cheating. Now that Paul realizes this, he retrains his model without the use of paid amounts.
50% Overall Accuracy. Doesn’t look so good after all.